finetune 75 0
Technology:
DPKD 25+50 0.0001 0.841 1 2
We run our experiments using PyTorch's distributed training on an Azure ML Nvidia DGX-2 In this section we present all the hyper-parameters used for training our models. We fix the gradient norm to be 1 and set the batch size as 1024 in all experiments based on [ 75, 34 ]. Structured pruning can be done by pruning attention heads, pruning encoder units, or pruning the embedding layer. KD is quite different from ours.
Technology: